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Abstract 


Reinforcement Learning methods having the ability to learn from “evalua- 
tive” feedback forms a viable alternative to Supervised Learning. Especially 
so, when “state-optimal action pairs” are costly or impossible to generate 
(because of lack of domain knowledge or the requirement of online perfor- 
mance). Algorithms with proved convergence to optimality exists for discrete 
state space models i.e. with table look up architectures. Application of Re- 
inforcement Learning to large or continuous valued state spaces, however, 
necessitates the use of function approximators like Radial Basis Networks 
and Neural Networks, albeit at the cost of loss of proof of optimality. 

This study investigates the applicability of radial basis function based 
reinforcement learning methods in a dynamic multiagent scenario of robot 
soccer. Radial Basis Function Networks poses as an attractive method of 
function approximation for these tasks. The advantages over more popular 
feed forward sigmoidal Neural Networks being that they can be interpreted 
as fuzzy rule based systems thereby offering comprehensible as opposed to 
black-box ANN models, they are much faster to train, are analytically incre- 
mental and being local in their response are more suited for reinforcement 
learning tasks than ANN. Opponent modelling, a factor so far neglected in 
this particular domain has been incorporated through the presence of oppo- 
nents assumed to act optimally. The application of RBF has been facilitated 
by expediting the recall time through the development of a new tree struc- 
tured implementation named RBF. 



Chapter 1 
Introduction 


This work considers a multi-agent, dynamic game scenario, where autonomous 
agents have to act in real time, in a fast changing, noisy environment. More- 
over in domains like these e.g. robot soccer, the result of an agent’s action 
may be explicit only at a future time. Learning algorithms, have enabled re- 
searchers to handle these complex domains, without the need for modelling 
the optimal policy explicitly. 

Learning algorithms can be divided into two broad categories supervised 
learning and reinforcement learning. Supervised learning is a methodology 
by which an agent can learn from ^’’examples", as opposed to performing on 
the basis of pre-formulated explicit policies. Supervised Learning requires an 
expert to generate a training set in the form of " (state, action) " pairs, that 
are optimal. Thus supervised learning cannot be used to learn directly from 
the environment. 

An alternative would be to use the feedback from the environment for the 
learning task. Environments, however seldom provide “instructions” of the 
optimal actions. A more realistic scenario would be were the environment 
provides an evaluation of the applied action in the form of rewards and penal- 
ties. Such learning tasks can be addressed by the method of Reinforcement 
learning, which can learn from “ experience''. 

Many of the Reinforcement Learning algorithms are based on look up 
table representation of the state. Real life problems often involve huge state 
spaces, which makes table based state representation impossible. Some com- 
paction might be obtained by replacing tabular lookup by function approx- 
imators, but even then computational requirements far exceed available re- 
source. A possible work around as introduced by Brooks [1], is to consider 
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Figure 1.1: soccer server screen 


the task to be composed of a hierarchy of simpler behaviours. Such artifi- 
cial, hierarchical segmentation is expected to introduce sub-optimality, but 
its advantage lies in making impossible problems tractable. 

1.1 Learning in soccer domain: 

Robot-Soccer has emerged as an interesting andt highly active domain for 
several artificial intelligence paradigms especially learning multi-agent be- 
haviours [2]. The current work deals with the simulation league, where a 
standardised software - “soccer-server” simulates the physics of the game by 
specified stochastic state update rule. The participants are required to de- 
sign the behaviours of individual agents which have to base their action upon 
the state as obtained from the soccer-server. The size of the state space at a 
resolution of Im and 1 degree is of the order of ^ 

Learning task in this domain has mainly utilized feed-forward neural net- 
works in the role of supervised learning. This leads to methods which are 
off-line, thus unable to adapt to opponents. Being black-box models they 
are difficult for a human to understand. The requirements for a supervised 
learning algorithm are also much higher because of its need of optimal state 
-action pairs. 

Among the different tasks learnt in this domain include simple kicking to 
empty goal [3] using feed-forward neural networks. The network was used 
to learn when to start accelerating to hit an approaching ball into the goal. 
Though the paper touches on the need for adversarial learning, where the 
goal is defended by a keeper, the authors did not train a neural network 

^The size of the field is 100m x 65m and the total number players is 22. 
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non-reactive derenaer circling in front of the goal in a fixed trajectory, by 
using memory based learning techniques. The learning consisted of choosing 
between two options shooting directly to the centre of the goal or passing to 
a team mate at a fixed position who took the goal shot. 

Decision tree induction (C4.5) was used by the same authors [5] to predict 
whether a team-mate could receive a pass from a fixed location. The task 
of unchallenged interception was learnt by the help of feed-forward neural 
networks in [6]. The same decision tree inducing software C4.5 was used by 
Tambe and his co researchers [7] to learn the optimum shooting angle from 
data generated by human experts. Up to 3 kicking solutions were considered 
namely at the sides and the centres. 

It is only very recently that online training has been executed in practice 
[7]. A mixed offline-online strategy was used to train a interception routine. 
The opponents were not considered explicitly. However, because the training 
was online, the individual agents could implicitly adapt to the behaviours of 
nearby opponents. 

1.2 Approach 

In this work Radial Basis Function Networks were used as function approx- 
imators for Learning of two low level skills, developed on the assumption of 
optimally acting opponents: 

1. adversarial shooting to goal. 

2. adversarial interception. 

The advantages of RBF networks include that they can be interpreted as 
fuzzy rule based system, having closed form expression for the learnt param- 
eters they are much more faster to train than feedforward sigmoidal neural 
networks, more importantly they can be updated online with guaranteed 
preservation of past training cases. RBF methodology gives a strong linear 
algebraic foundation to the field of fuzzy inferencing, thereby introducing a 
number of tools for tuning its parameters. The validity of convergence of 
most Reinforcement Learning algorithms depend on tabular state represen- 
tation and fail to remain true when function approximators are used. This is 
primarily because of over generalisation over the input space, to which ANN 
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Figure 1.2: The agent in light coloured circle tries to shoot to goal in presence 
of opponents shown with a light ring around a dark interior. The Light arrow 
in the left shows the prescribed direction of shooting (left). 

On the right the agent tries to intercept a moving ball signified by the light 
arrow, in presence of opponents. The end of the arrow shows the point the 
ball can reach in 10 time steps. The prescribed trajectory is shown by the 
dark arrow. 

are typically prone. RBF being local models is largely free of such limitations 
and are more suitable for Reinforcement Learning tasks. 

The use of Radial basis function networks which incorporates the com- 
prehensibility of a fuzzy rule as well as a probabilistic interpretation, has not 
been seen to be applied on this domain before. In the previous works in this 
field, opponent models have not enjoyed much research interest. The current 
study not only considers the presence of opponents in these elemental tasks, 
but also models them as ideal opponents. Thus the opponents are always 
assumed to act optimally. 

The current implementation progresses in two steps. The first step is 
where the agents learn the action in presence of optimal opponents. This idea, 
inspired by mini-max criteria in game theory however cannot take advantage 
of non-optimal behaviours of opponents. Thus in the second phase of the 
learning task the agents can adapt online to a particular opponent as the 
play commences. Thus trying implicitly to span from a mini-max strategy 
to recursive, agent modelled strategies. The second part requires a complete 
agent to play against an existing opponent. In the absence of a complete 
agent the second part could not be implemented in practice and is established 
only as a methodology. 

The contribution of the work includes: 

• Introduction of radial basis functions in the domain of robot soccer, 


leading to comprehensibility of learnt models, as shown in figure (1.2). 

• Incorporation of optimal adversaries in the learning tasks. 

• Expediting data recall from RBFN, by the development of a new tree 
based implementation called RBF tree. 

• Introduction of an analytically incremental function approximation mod- 
ule enabling guaranteed preservation of past training cases. 

• Introduction of online learning in robot soccer domain. 
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Chapter 2 
Learning 


The problem faced by a learning agent in a state S is to choose from a set 
of permissible actions A a particular instance so as to maximize, a suitably 
chosen performance measure. These kind of problems can be modelled as 
Markov Decision Processes. 

MDPs are defined by 
a set of states: S 
a set of actions: A 

a set of transition probabilities T '■ S x Ax S P, P e [0 — 1] 
and a set of expected rewards : 

R{St X Ax Sf+i) = E[rt] 

Where rj is the reward obtained at time t. The performance is evaluated 
as 53 where 7 is a discount rate. The policy 7r(<S) : SxA^ P, P E [0—1] 
is chosen which maximizes the performance measure. Knowing the model 
parameters it is possible to find the optimal 7r*(5) by value iteration or 
policy iteration [8]. 

In the event where the model parameters are unknown as is the case for 
most real world scenarios, different learning techniques can be used. The two 
popular methods are supervised learning and reinforcement learning. 

2.1 Supervised Learning 

If it is possible to obtain, at no exorbitant cost , the state to optimal action 
mapping 


S^A* 
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supervised learning is prescribed. This is because supervised learning gives 
the required action explicitly in the form 


A* = f(S) 

which can be easily utilised by the agent. The only drawback is that it pre- 
supposes the availability of a set of ''Hnstructive" examples in the form of 
state-action pairs. Such a training set spanning an adequate portion of the 
total search space may be difficult or even impossible to provide, in any event 
it would require an expert and thus would be hard to implement online. 

2.2 Reinforcement Learning 

In cases where “'instructive''’ feedback is not available, another branch of 
learning algorithms can be used, which can learn with a less demanding 
availability oi “ evaluative" feedback from the environment. Thus reinforce- 
ment algorithms are said to learn from experience as opposed to supervised 
learning which learn from examples. Reinforcement learning have an added 
advantage of being able to keep up with dynamic model parameters because 
they do not need instructive information and thus can be implemented online. 

Reinforcement learning however usually does not support an explicit map- 
ping in the form 


= f(S) 

but produce indirect mappings like 

<S X wA I — ^ Q or S I — ^ 

in which case the agent has to search the action value function Q{S, .4.) on or 
the state value function’^ V(5) through a maximization operator argmaxQ 
to obtain the required action. When these are repr^ented as lookup tables 
this search is trivial , but not so in case they are represented by different 
function approximation techniques like neural-nets. 

Hhese quantities are defined later. Refer to equations (2.2) and (2.3). 
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2.2,1 Single Epoch Reinforcement Learning: 

One of the difficult issues that has to be addressed by reinforcement learning 
is that of temporal credit assignment. This stems from the fact that reward 
may be obtained as a consequence of an action executed far in the past. For 
learning to be successful it becomes necessary to associate the current reward 
with the past action. 

Single epoch domains, being free from this problem are simpler. The 
much studied “k-armed bandif’ [9] problem is such an instance. The problem 
involves an agent trying to maximize his earnings from a k-armed bandit 
gambling machine in a fixed number of attempts. Each of the arms ouputs 
a reward or a penalty based on a probability unknown to the agent. Thus 
the agent has to balance exploration with exploitation. We will find that the 
goal shooting task discussed in section [5] to be a similar problem 

The action value function based approach characterizes each action Ai 
with a value Q signifying the average expected reward. 

The optimal action would thus be to choose the action with the highest 
action value. 


^ — argmaxA 

However the actual action value Q is unknown to the agent and he has to 
base his decision on an estimate Q. which he has to learn. A simple strategy 
would be to take the average of the observed rewards [8]. 

Qt = = Qt-i + -{Tt — Qt-i) ( 2 . 1 ) 

Thus updating the previous estimate by the new observation with a grad- 
ually reducing step length equal to 

2.2.2 Infinite/Multi epoch Reinforcement learning: 

The quantity to be optimised for an infinite epoch task is modelled as 

Return = rt + 'jrt+i + 'y^rt+2 = 0 < 7 < 1 

Where discount factor 7 is a mathematical means of bounding the return, 
and also a means of modelling the intuitive fact that a reward far in future 
is worth less than an immediate reward. For memory-less dynamics it is 
possible to model the task as an MDP as stated earlier in this section. 
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The same value function based approach can be applied, but now the 
action value function Q, in addition to the action A will also depend on the 
current state S , more over we can also define a state value function V referred 
simply as value function. These quantities are defined as 

V"(<S) = = 5, TT = 7r(«s)] (2.2) 

this is the expected total discounted rewards for visiting state S when fol- 
lowing policy TT. 

A) = E[J2 = 5, A = A TT = 7r(<S)] (2.3) 

this is the expected total discounted rewards of choosing action A in state S 
and following policy tt. 

These state and action value function follows the recursive relations shown 
below. 


V(5) = E’f(S)-Er(S.A5') [k(S,AS') + 7V{5')] (2.4) 

s' 

0(5. A) = Y1 ns, A, S') [r(S, a 5") + tV(5')] (2.5) 

s' 

Here S and S' refer to St and St+i respectively. These equations can be - 
used to iteratively update the estimates the after they have been initialised 
randomly, forming the key element of value iteration and policy iteration. 
This is basically iterative solution of a system of linear equations. These 
methods however require the model dynamics to be known as T{S,A,^) 
and 72.(5, A, S') occur explicitly in the formulations. 

So there are two approaches to reinforcement learning. One category 
being the model based methods like certainty equivalent methods [10], Dyna 
[11], prioritised sweeping [12], RTDP [13]. These use various methods to 
estimate 7l(S,A,S') and T(S,A,S^) and solve for the value functions using 
the equations (2.4) or (2.5). Model free methods on the other hand can work 
without the knowledge of system dynamics e.g. Actor Critic Methods [14], 
Q-leaming [15] etc. 

Actor Critic Methods consists of two components one learning the value 
function V(5) the other learning the policy tt. Observing an experience tuple 
< Et, Atrt, St+i > the value function is updated as 
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V{St) - V{St) + a\rt + -fV{St^{) - V(5i)] ' 

This is another version of the equation (2.4) with the summing over 
probabilities removed. If the agent explores the environment visiting 
site infinitely often, following the transition probability T{S,A,S ) they 
converge to the same value. 

A positive value of rt + 7V(<St+i) — V(5t) implies a preferred action, 
depending on this value, the policy tt is updated. A possible way being- 


7r(5t) — argmaxA 


Q{St, A) = Q{St, A) + Pin + 7V(<St+i) - V(<St)] 

Q-leaming on the other hand do not need to learn the value function an<3 
works solely on the basis of action values. The update equation for wbic 
given as 


Q{St, At) = Q{St, At) + Q;[r, + 'rmaxAQ{St+i,A) - Q{St, At)] ^2-7) 

2.2.3 Using Function Approximators: 

All the algorithms discussed so far assumed that we can identify each state 
separately and update the value functions accordingly. Proofs of convergence 
of the above methods exist for such assumptions [16]. 

However practical implementations, especially for huge and continuous 
valued states spaces like robot soccer become impossible with the ta 
lookup methodologies, owing to the enormous memory requirement. 
sible avenues involve 

• partitioning the state space into aggregated blocks, or 

• using function approximators. 

Function approximators having a long history of wide and in depth study 
are the most tempting alternative. Both of these approaches however 
the same thing that is generalise across the state space. Crisp partitioning 
of the input state explicitly considers some distance value within which 6 
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state value functions are assumed to remain constant. On the other hand for 
function approximators this fixing of the distance measure is generally not 
explicit. 

Convergence proofs refuse to remain valid when function approximation 
techniques are used. Pathological cases have been put forward where value 
function approximation errors have grown arbitrarily large even for simple 
toy problems [17]. Though convergence is not guaranteed, there has been 
successful demonstrations also [18]. The main difficulty arises from being 
unable to control distance over which generalisation takes place. Neural 
networks are specially susceptible to this drawback as their parameter have 
nonlocal effect on the learning process. In spite of the absence of convergence 
proofs the current work will involve Radial basis function networks as a 
function approximating module. 

Domains with large state spaces but with small discrete set of actions, 
can use separate function approximators for separate actions, or a function 
approximators having separate output for the separate actions. In aU these 
cases the input to the function learner is the state only. However if the 
action itself is continuous valued or if the action set is very large the above 
procedure becomes impractical. In such cases it becomes necessary to include 
the action also as an input. The function then has a form of <S x Q. 

Selecting the action then becomes a search executed on this learnt func- 
tion. Given the current state St, the action with the highest value of Q{S,A) 
to find the action with the highest action value given a state. For function 
learning methods this involves non-trivial computation as opposed to cases 
where separate action have separate function learners. The standard proce- 
dure is to do a gradient ascent on the learnt function. Bairds [19] wire-fitting 
function approximator gives an example where this search can be greatly 
simplified. 

Here an attempt will be made to execute the primal phase as a supervised 
learning process by approximating the <S ^ by radial basis functions. The 
form S X Q will also be learned and will be kept relevant in a non- 

stationary domain with the help of online evaluative feedback , a task difficult 
if 1 -^ .A is the only representation that is maintained. The necessary 
examples or the training set will be generated with the help of multiple 
simulations. 
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Chapter 3 

Function approximation with 
Radial Basis Function networks 


Radial basis function networks are a class of universal function approxima- 
tors [20] which have strong resemblance to distance weighted regression or 
memory based approximations to functions. Unlike sigmoidal feed forward 
networks, radial basis function networks consists of a number of locally ap- 
proximating units spanning a suitable amount of the input space. The output 
of these local units are combined to give the final approximation. 

The advantages over feed-forward sigmoidal network includes that the 
network can be interpreted as a fuzzy logic rule based system thereby of- 
fering a lot of comprehensibility; but perhaps the greatest advantage lies in 
the ease with which it can be trained and the fact that they have only a 
single minimum which is obviously globaU. Radial Basis function networks 
have a closed form solution for the trained weights thereby eliminating time 
consuming iterative methods. Moreover these units try to approximate the 
function locally thus any change in the network will affect only a small part 
of the input space - this makes it possible to edit it incrementally. Moreover 
these models have theoretical formulations for incremental operation, where 
new data can be added without disturbing past learnt experiences. 

^this is for the case where only the weights axe considered as modifiable parameters. 
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Figure 3.1: Radial Basis Function Network 
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3.1 Linear combination of Radial basis vectors 

A function of the form f{x,c) where x is the dependent variable and c is 
a constant vector, is called a Radial Basis Function when it depends only 
on sonae distance metric between the vector x and the centre c. The most 
common form of distance metric used is 

r = [x-cf[E]-i[x-c] (3.1) 

here the width matrix [E]“4s generally taken to be diagonal. This results 
in the separation of the input space into a number of hyper ellipsoids as shown 
in fig [3.2]. Depending on whether the [S]"^ is diagonal or not the major 
axis of these ellipsoids are aligned or at an angle with the axis of the input 
vectors. Non-diagonal matrices naturally offers more flexibility but at the 
cost of more parameters to tune. This can be conceived as a 4 layer network, 
i.e. having an extra linear preprocessing layer such that it output is given as 

X = [W].[x-c] 

Thus the new metric now gives 

r = [A]^[E]-^[A] 

r = [x - cp [x — c] 

Many forms of the function have been used - multi-quadric, inverse multi- 
quadric, thin plate splines but the most popular is the Gaussian form. 

An RBF consists of a number of such radial basis function units, whose 
outputs are combined linearly to give the final output. Thus the output is 
given as 


y = ( 3 - 2 ) 

1 

Given a training set in the form of N dataset consisting of the input vector 
xi € T^.'^and the desired output yi we find the values of Wi which minimizes 
the error 
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(3.3) 


E = S(s “ if 

\ 

Considering the centres cj to be known this minimization is linear in the 
parameters to be determined. The output of the RBF net can be written as 

Vi fn • ■ • fnl ^1 

Vn \ L J L 

or 

Y = FW 

Thus essentially the radial basis networks convert a 71'^ feature vector to 
a vector 7^^'usually of higher dimension; which are linearly combined in the 
ratio given by the vector W to give the necessary approximation. 

The closed form expression of the weights W can be obtained by elemen- 
tary linear algebra as 

w=(F'^Fy^ F'^Y (3.4) 

This solution is unique and minimizes the network globally. 

It is also possible to implement a normalized form of RBF nets where 


y = L = -^7 - (3-5) 

1 2^1 Ji 2-/1 Ji 

This form is utilized as this can be interpreted as a fuzzy inference engine 
as well as in a Bayesian framework as will be shown shortly. 


3.2 Different error measures 

3.2.1 Ridge regression: 

The error criteria considered for minimization in equation (3.3) does not 
capture all the qualities desired for a good generalizing network. One such 
requirement is that the weight terms do not shoot up, which usually results 
in over-fitting of data and roughness of the learnt surface. This criteria can 
be encoded in the form of a penalty term 
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(3.6) 


s = i2(!/ - py + 

1 1 

E = [y- yf[y - y] + [A]^[iy]^[u;] 

The modified weights are given by 

W = [F^F + diaglA] ]-^F^[r] 

The factors A,- are called the ridge parameters and helps in improving the 
condition number of [fij]'^[fij] thereby making the inversion more stable. 

3.2.2 Weighted error criteria: 

The error function (3.3) gives equal importance to all the training cases. 
Sometimes it is necessary to weight the instances differently. Consider a 
typical case of learning the desired action A which maximize the action 
value Q(Ai). 


argmaxAi Q{A) = 


as 

The previous error term (3.3) will penalise the difference between the 
optimum action value ^*and the predicted action value. However we should 
be more concerned with the loss in the actual performance as a result of the 
error in action selection i.e. 


e = Q{A*) - Q{A*) 


By expanding about the optimum action value we get 


e=QiA*)-Q{A*)-^^U^^.AA- 


dQ 

dA 




AA^ + 


1 A .2 

® 
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We want to minimize the total magnitude of this error over the training 
set 1 < A: < A^. We note that is always positive thus it is enough to 
minimize 

considering abs{^^ | ) = Sjt we get the modified error term as 

E = [Al- Xk^SlAl - Xk] 

where S is diag{sk) The solution of weights for which can be found to be 

W = [F'^SFj-XSA 

3.2.3 Cross-validation 

In addition to minimizing a suitable error criteria over the training set it is of 
vital importance to check how the learnt function approximator generalizes 
to cases not present in the training set. The method of estimation of gener- 
alisation error is to divide the available data into a training and a validation 
set. The function is trained using data only from the training set and then 
tested over the validation cases. 

Therefore it becomes necessary to consider, how to partition the available 
data into two disjoint sets. Moreover keeping data for validation implies a 
missed opportunity for training . When obtaining data is costly this can be 
a significant loss. 

A procedure to deal with this is to use leave one out cross validation 
(LOO). Given a training instance of N cases it is split into N-1 training cases 
and one testing case. This partition can be done in N different ways. LOO 
cross-validation is the average of th^e testing error. 

Radial Basis Networks being linear models enjoy a closed form solution 
of the LOO cross validation in terms of the projection matrix P. 

2 _ Y^P'^{diag{P))-^PY 

^LOO - 

Where the projection matrix F is related to the error vector as 
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E = - Y] = Y - FW = Y - FA-^F'^Y = [ J - FA'^F'^] = PY 

Where A = F^F 

A common simplification to (tIqq is to replace the diagonal matrix diag{P) 
with another diagonal matrix having all the elements equal to the average of 
those in diag{P). This simplified criteria is called generalized cross-validation 
and given as 


, Y'^P^IPY _ NalsE 

GCV ^(^ tra c eiP) ^ {trace{P)Y 

3.3 Online adaptation of RBF weight parame- 
ters 

All the methods presented so far assumed that the entire training data was 
available beforehand, thus these are offline methods. However for reinforce- 
ment learning one has to adapt online to new experience {Vnew^^new) • 

The most common method is to use line search on the error gradient 
direction — V£'. This can however destroy previous learning instances already 
encoded into the parameter values w. To prevent this one usually takes a 
small step along the gradient instead of a full line search. 

Aw = -(Y^E\3;=x„^ oi — small 

This however cannot utilize the new data sufficiently- Thus it is neces- 
sary to have an incremental update rule that can preserve all the previous 
instances. Radial basis function Networks is one of few smooth function 
approximators for which this can be derived. 

Most of the update rules of reinforcement learning follow the pattern 
where observation {ynew, ^new) is used to form the training set (ypredicted, ^new) 
which is used to update the network as 

yt — Vt—l “i" G!.{t)[ypredictedi.y7iew) ~ yt] 

If the new training instance happens to coincide with a previous instance 
{yoid, xm) one can simply replace the y^d by yprediaed , and obtain the weights 
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by making changes only in the V matrix. Otherwise we have to induct it as 
new training instance. This will add another row to the matrix F giving the 
new matrix as 


F — 

new — 


F 

$ 


where = /,( 

^new 5 ’). The new weights for which can be given as 

Yold 


^new ~ [-P" new ^ f 


'T IP l—l jpT 
new I ^ new 


L ypredicted J 


W = 

f ' n.p/tn — 


FmFom 




~1 




Unew 


In this case the value of F^inX will already be known and the matrix 
inverse on the left can be expressed in terms of the pre-calculated inverse 
matrix [F^iiF]~^ using the matrix inversion lemma [21] as 




1 + 


This saves a considerable amount of computation especially through the 
avoidance of matrix inversion thereby reducing the complexity from 0{v?) 
to 0{n^). 

The RBF model is analytically incremental, not only in terms of addition 
of new data but also in terms of addition or deletion of centres. This offers 
a great advantage by allowing induction of new centres without the need for 
re-tuning from the beginning. 

The addition of a new centre will add a new column to the F matrix 




keeping the Y matrix unchanged. 

The changed weights will be given by 


Wn 


{FnewFnewj F^^^^Y 


or 
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w = 

» r np.m — 


FhA 




T-l 


[Fm : 


The inverse of the matrix on the left can be evaluated utilizing its block 
partitioned structure. Note that and are column and row vectors 
and is a scalar. The inverse of a matrix having the following structure 
is given as 


■ A 

Vi 

-1 

■ A 

Xvi - 

g 



v^X 


^2 

s 


S 

W 


where ui and U 2 are vectors and s is a scalar. The other terms are given as 

S 1 + V2A~^Vi 

no new matrix inversion is necessary as A~Ms already pre-calculated. 


1 .. vJXvi 


w = -(1 + 


) 


In this case also computational complexity is of the order 0{n‘^). 


3.4 Fuzzy reasoning 

Fhzzy logic is a popular means of approximating complex functions or models 
using linguistic rules. They have been used extensively in complicated control 
problems [22]. Jang [23] proposed a generalised neural network representation 
of a fuzzy inference system, thereby allowing tuning of the fuzzy model by 
backpropagation. Optimising a fuzzy system is an area of active research and 
various methods based on gradient descent as in [24] and genetic algorithms 
[25] have been used. 

A fuzzy inference system can be defined for a d dimensional input of the 
form X and an output y by a set of n rules. Each rule is defined as 


i?j : if xi is Ail and xa is A^ . . . x^jis Ay then y is Cj 
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where Aij forms the antecedent associated with the rule and the 
component of the input vector and Ci forms the consequent for the f^'^rule. 

For each of the antecedents membership functions 

map the variable values to a scalar signifying the degree of membership of 
the variable to the fuzzy antecedent. 

The strength of firing of the rule consisting of a conjunction of such 
antecedents is interpreted depending on how conjunction is defined in the 
inference engine. Two common method are the min operator and the product 
operator. Using the product operator we get the strength of the rule as 

d 

= n 

1 

Depending upon the implementation the consequent may be modelled as 
a fuzzy set with its own membership functions or as in [26] by constants. In 
the latter case the final output of the fuzzy inference engine is given as 

or if the memberships are already normalized as 

n 

1 

The shape of the membership functions is free to be implemented as 
desired. The most simple being triangular and trapezoidal membership 
functions. For modelling highly nonlinear phenomena Gaussian membership 
functions are popular . A Gaussian membership function is given as 

- 1/2 

Pi = kexp 

|Ui = A:exp-^/2(t-^3"Pl-Hx-cl) ( 3 _ 8 ) 

Comparing the equation (3.8) and equation (3.7) with the equations (3.2) 
(3.1) we see that the output of the each radial basis unit can be read as the 



(3.7) 
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strength of firing of a rule and the consequents can be interpreted as the 
weights. This conveys another advantage to the RBF networks over the 
more common sigmoidal feed-forward neural networks, by the fact that they 
can be interpreted in terms of linguistic rules. 


3.5 Clustering as Antecedent induction 

As mentioned in section (3.1) radial basis network parameters consists of 
a centre c which is determined a priori and a weight w which is found by 
solving the resulting linear least square problem. Prom section (3.4) we see 
that the centre corresponds to the antecedent portion of the fuzzy inference 
engine whereas the weights correspond to the consequents. 

Different clustering techniques which are used for finding the centres, in 
effect generate the antecedents of the rule and can be seen as a rule induction 
method. However there is a difference how the rules are induced for fuzzy 
inference engines and radial basis functions. In fuzzy rule based systems the 
universe of discourse of a particular variable is split into some number of 
linguistic sets to form the antecedents. These are then combined exhaus- 
tively with the antecedents of other variables resulting in a rule base size 
exponential in the the number of linguistic partitions. 

For radial basis function networks, it is generally pre-decided how many 
rules or centres are to be used. The entire data set is divided into as many 
clusters and the centres of those are taken to be the the antecedents. This 
automatically allocates the location of the membership function according to 
the distribution of the data. This leads to finer partition of regions rich in 
data. If the number of rules or centres prove insufficient more can be added 
in the regions of poor approximation. 

RBFs however are badly effected by the presence of variables which are 
irrelevant. As in that case it has to take into account the entire space of 
the variable combined with all other antecedents so as to learn to ignore the 
effects of that particular variable. 

3.5.1 Hard k-means clustering 

Clustering can be visualised as a memory based learning or data compression 
where the dataset consisting of p instances of a d dimensional vector x 
is partitioned into I <k <p clusters, each of which has a centre in the form 
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of a vector Cj that best resembles all the vectors allocated to the cluster j. 
The degree of membership of an instance a; to a cluster j , is represented by 
the membership function 


jUj(x) 

In hard clustering each instance of the training set can be allocated to 
at most a single cluster, or in other words the membership function is con- 
strained to be 1 or 0. 

The hard k means clustering is started with a set of estimated centre 
locations Ci^t=o € Following which the following steps are executed. 


1. calculate the membership of the instance to the j*^centre 



ifyi^j\\xk-Cjf<\\xk-Cif 

otherwise 


2. calculate the new estimates as 




l^j,kXk 

Sfc l^j,k 


3. calculate shift as 

1 

4. if shift E < e then exit else repeat from 1 

The procedure of hard k-means clustering is guaranteed to converge, but may 
get trapped in local minima. To alleviate this problem we add a gradually 
decreasing noise level to the centre update equation. The annealing schedule 
followed was 




Kt) 


iteration 
Max iteration 


where A/i and A// refer to the noise levels at the initial and final stages. 
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3.5.2 Soft k-means clustering 

In soft clustering an instance of the training set can belong to different clus- 
ters with different degrees of membership. The membership values /i < 1 but 
need not sum to 1. Different membership functions can be formed based on ‘ 
the Euclidean distance dkj = \\xk — Cj||^ of the instance from the centre. A 
popular implementation is as follows [27]. 


1. set initial estimates of the centres Cj^t=o G 

2. calculate the membership values, for a softness index m 


Hk = 



(m-1) 


3. calculate the new estimates as 

EMk)- 


4. calculate error 

1 

5. if error E < e then exit, else repeat from 2 

here as m — >• 1 the algorithm approaches the hard-k-means algorithm and as 
m—^oo the more soft is the clustering. 

3.5.3 Expectation Maximization clustering 

Expectation Maximization [28] is an algorithm to find the Bayesian Maxi- 
mum Likelihood estimate of a parameter ( vector in this case ) when some 
information is missing. Its typical use is in the estimation of the means 
and variances of a mixture model from the raw data given the number of 
mixtures. 

Let us assume that we have a mixture of n Gaussian processes each char- 
acterized by their mean vector Cj and the covariance matrix [S^]. The data 
set is generated 1st by choosing a particular Gaussian process j following 
which a data set is generated using the given probability density function. 
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The objective is to determine the maximum likelihood estimate of the param- 
eters Cj, [Sj] given a data set Xk k = I . . . N without access to the information 
about which process j was used to produce the data Xk- 
This is solved using a Bayesian analysis as follows 
let prior probability that the process j is selected be given by 


J^centTtj — t j Tl . . . ZTlztzCLlly 

The probability that the data Xk came from the process j can be inferred 
by Bayes theorem as 


Picentre = i/x.) = = 

E]=iP{xk/ centre = j).PcentTej 


(3.9) 


The probability p{xk/ centre = j) is known given the parameters p and 
[E] . This is given as 

p{xk/centre = j) = 

for such a model the expectation of y given the value of x is given as 

. /„! Si P{xk/ centre = j) .y^ 

= TlPMcentre^j) 

The objective is to maximize the likelihood 


k=p 

JJ P {centre =jjxk) 

A;=l 

with respect to the parameters Cj and [S] . The maximization is simplified 
by maximizing the log of this likelihood as 

k=p 

max ^ log P{centre = j /xk) 

k=l 

The algorithm can be defined in terms of 2 steps 


1. Estimation: which involves calculating the probability of ownership 
given the current estimate of the parameters Cj,t ■ This is obtained 
from equation (3.9) 
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2. Maximization: which involves finding the new values of the parameters 
Cj and [Sj] which maximizes the log-likelihood given the current esti- 
mate of the probability of ownership. This step reduces to weighted 
averaging process as shown 

= ^kPjcentre - j/xk).Xk,t 
Efc P{centre = j/xk) 



The elements of the matrix [Sj] are also similarly obtained. If r,cP'j denotes 
the element at the row and column of the matrix it can be evaluated 
as 


r,c 


^j,t+l - 


Efc Pjcentre = j/xk).r,c^j,t 
Efc P{centre = j/xk) 


In case the covariance matrix is diagonal one can directly obtain the the 
terms of 


3.5.4 Gradient Descent Clustering: 

The centre parameters c and a can also be obtained by operating a gradient 
descent on the error function. This is facilitated by the fact that the RBF 
form is analytically differentiable. 

Using the error function (3.3) the gradient with respect to a parameters 
of the centres 9k is obtained as 

f = (3.11) 

from the equation (3.5) we obtain 

dy _ (E" fi) ^ ~ /fc^ (El fi) ^ 9^ (El / i) 

^ " """ {EUif h (E? fif 


or 


(Ei/y 

^ {wk - y) dfk 

Ei/i d9k 


(3.12) 
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For updating the centre parameters Ck and crjb we use appropriate 
terms in place of 9. These are given by 


and 


dcki 



fk 


{Xki Cfcj) 




dfk d A^ki-Ckif 

j = cl 


(3.13) 


(3.14) 


These equations can be chained to give the necessary update rule in the form 


9t+i — 9t — aVE 

Since this procedure directly minimizes the error the final result obtained 
by this is obviously better, but this method is more computationally intensive 
than the others. Good results are obtained by interlacing this centre update 
rule with recalculating the weights w by the equation (3.4). 


28 



Chapter 4 

Tree Based Representation 


In addition to the time taken for training the system, one also has to con- 
sider the ease of retrieval. Lazy learning techniques for instance, defer all 
processing till a prediction is to be made. Thus they have practically zero 
training expense, but retrieval costs are substantially higher, especially with 
large training sets. 

For the case of radial basis function networks with a large number of 
centres, often only a few of the centres have a significant contribution. A 
methodology which can eliminate centres which are insignificant for the cur- 
rent input, will save a lot of computation and will make the retrieval faster. 

Methods for accelerating memory-based learning had been proposed by 
Deng and Moore [29] which stores the instances as a, k-d tree, thereby helping 
in reducing the computational cost of finding the nearest neighbours. 

.The current implementation of radial basis function tree will try to parti- 
tion the input space recursively, such that within a partition only a handful 
of the centres are active. Thing to be noted is that the referred implementa- 
tion [29] partitions the training instances while this will partition the input 
space. 


4.1 K-d Tree assisted Memory based learning: 

A k-d tree is a special binary tree data structure used for storing multiple 
instances of a k dimensional vector x € i2* in a systematic manner. Each 
non-terminal node having a right and a left child, splits the exemplar set into 
2 disjoint sets on the basis of a particular component of the vector x . The 
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forSth! J'k the splitting variable and the value used 

or spl ttmg s the sphttrng value. All instances of the eiemplar set having 

the vatae of the sphtting variable higher than the splitting value chosen are 
located m the left sub-tree and the others on the right sulatree. K-d trees 
are popu ar for accderating multi-dimensional search as binary search trees 

“hZ ^ 

Formally let £ si^fy exemplar set of instances of n with i ranging 
from 1 to N. Then the root node of the tree T is defined as a set 

'^Toot{p^\ X E 


root Ti^ft child H Ty. 

'^leftchild{x\ {xeE)n (Xi > spliti)} 
bright child{x I (x € E)n (Xi < SpHti)} 

4.1.1 Kernel Regression: 

Kernel regression is a special case of memory based learning very similar to 
that of radial basis function networks. In this case all the past training data 
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Figure 4.2: iso-response elliptical bands superposed over distribution of data 

N are used to produce the output for the new query point x^. The output is 
given as 


Where K{disti{^) is the kernel function depending on the distance met- 
ric disti{x^) of from the training instances xj. Comparing this with radial 
basis function networks we see that this is same as an RBF with all the past 
training cas^ acting as centres and using the same distance metric for all 
the cases and using the Gaussian as the kernel function. 

Thus for hyper-ellipsoidal bands of narrow thickness defined over the 
training instances as shown in [4.2] , the value of the kernel function would 
not vary much.The kernel functions can then be approximated by the kernel 
function value at the centre of this band.Thus if we divide it into I number of 
such bands of almost constant kernel values the regression can be obtained 
as 


y(%) = 


{distmidi 


( 4 . 2 ) 
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note now we have to sum over I « N , rii is the number of training cases 
within the band, distmidii^) is the kernel value at the centre of the band i. 


4.2 Radial Basis Function trees: 

Deng and Moore proposed storing every instance of the training set on a 
k- d tree. Given a query point it was possible to identify the minimum and 
maximum response of the kernel function within each partitioned rectangloid. 
This information then could be used to decide whether to traverse that branch 
further. Such a method of pruning the search, cannot be used for accelerating 
radial basis function networks. This difficulty arises from the fact that each 
of the centres uses a different distance metric. Had the matrix [S] used in 
the distance metric formulations of the centres ( equation 3.1 ) been same 
for all the centres then the same technique as above could have been applied. 

Following this difficulty a different approach is taken as explained. The 
entire input space is taken to be a finite hyper rectangle and is divided into 
a number of blocks. The simplest of which would be a grid partitioning. For 
each of these blocks it is necessary to identify the set of radial basis units 
which have a significant output at any point within it, these are maintained 
as a list. A radial basis function unit fi is considered insignificant if 

(Vx) G block /(x) < e 

or max /(x) < e (4.3) 

x^block 

the efficiency of this algorithm hinges on the ease with which we can 
calculate the expression (4.3), note however that this has to be done only 
once. For small enough sizes of the blocks the effective number of centres Weff 
within the block would be small, provided we can easily identify the block 
given a query point this will sufiiciently reduce recall expenses. 

4.2.1 Finding maximum Response of a RBF unit within 
a hyper rectangle: 

For the centre Cj of the radial basis unit located within a hyper-rectangle the 
ma xiTmi Tn will be at the centre itself. For centres outside the hyper rectangle 
the point where the maxima occurs needs to be found. 
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Figure 4.3: Location of the minimum distance norms for a given input space 
and different radial basis centres. 

Since equi-response lines of RBF units are hyper-ellipsoids, the process of 
finding the point with maximum response reduces to finding the point of in- 
tersection of the hyper-rectangle with the smallest of the system of concentric 
hyper-ellipsoids around the centre Cj. 

It can be noted that this will be located only on the boundary of the 
hyper-rectangle. The point will be located either on a vertex or a hyper- 
plane forming a face of the hyper -rectangle. Moreover the point will be 
located on a hyper-plane only when the ellipsoid makes a tangential contact 
with the hyper-rectangular block. 

For axis parallel ellipsoids such tangential contact can occur only along 
the major and minor axes, thus such points will lie on the intersection of 
the axes with the boundary of the hyper-rectangle figure [4.3]. For inclined 
ellipsoids one can identify a system of lines through the centre along which 
the points of tangency can lie, thus one needs to search only the points of 
intersection of these lines with the boundary of the hyper-rectangle. 

However in the case of axis parallel ellipsoids, as is being used in the 
current implementation of radial basis units, the identification of the point 
of maximum response is computationally trivial. Consider that the hyper- 
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rectangle is specified by the upper and lower limits of each dimension Li and 
Hi and the components of the centre of the radial basis unit to be given as Cj. 
Let the point of maximum response hepi. The components of p are obtained 
as 



Hi 

Li 


if Li<Ci<Hi 
if Ci> Hi 

if Ci< Li 


4.2.2 Partitioning: 

The previously mentioned strategy of grid decomposition is inefficient as it 
imposes the constraint of equality on the size of the blocks. Unequal blocks 
on the other hand are expected to require less number of partitions. The 
methodology to be followed would be to recursively partition the input space 
hyper-rectangle along different dimensions till the effective number of centres 
rieB for the blocks fall below a threshold. The most intuitive repr^entation 
of such a partitioning is a, k-d tree. 

The objective of the k-d tree is to reduce the number of effective cen- 
tres within each partition. This is done hy suitably choosing the splitting 
dimension and the splitting value. Given a splitting dimension the optimum 
choice of a value is one which follows the the minimax criteria, that is which 
minimizes the maximum Ues of the two partitions produced 


split value = minmaxueff 

d 

Once the mechanism for selecting the splitting value is specified, we can 
identify the splitting dimension as that dimension which results in the max- 
imum reduction. The optimum splitting value can be obtained by executing 
search along each of the dimensions, but that will be computationally inten- 
sive. 

To obtain the splitting value we adopt two heuristics 

1. Median value split: This ensures equal number of blocks in each par- 
tition, but the effective number of centres for each block may not be 
adequately reduced as a result of contribution from centres beyond its 
boundaries. 
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2. Select the splitting hyperplane so as to minimise the response of all cen- 
tres on it. This reduces to selecting that hyperplane which maximises 
the sum total of the different distance metric used by the centres. 

We evaluate both the heuristics, and adopt the better. It has been seen that 
the latter approach gives better results, more often. 

For axis parallel radial basis units the distance metric is of the form 

'Y~' i^k Cfc) 
k ^k 

refer to (3.1). The sum of this distance is to be maximised while ensuring 
that the hyperplane remains within the convex hull of the centres. This 
constrained optimisation problem 

max,, , Y,r-ntresonlist ^ Ek^k = 1 

fc.l 

can be formalised with the help of Lagrangian multipliers as 

max^ -\-{ck,i - Y, WkCkf + Wk - 1) 
k ^k,i k k 

differentiating with respect to Wj we get 

2x,-i(E^-EwfcC,.E^)+A = 0 Vy 

\ k,i k,i J 


E ^ — E u^fcCfc. E 

fc.i/ 



This is only possible when expressions on both sides of the equality are 
individually zero. Thus we obtain the splitting value 


Si=Y WkCk = 



Thus these two heuristic consists of selecting the median or a weighted 
mean of the centre values. The splitting variable can either be chosen by 
evaluating each of the variables and selecting the best or use the popular 
heuristic of choosing the variable with the maximum spread. 

The node structure can be represented as 
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/* for identifying the split variable*/ 


Struct kdtree{ 

int split_index; 
float splitval; 
centre cornerl ; 
centre corner2; 
int n_eff; 
char ’I'list; 

struct kdtree *left; 
struct kdtree *right; 
}node ; 


/* specifying the minimum corner of the hyper recti 
/* specifying the maximum corner of the hyper rect< 
/* records the number of effective RBF units */ 

/* maintains the list of effective units */ 


The building of the radial basis tree can be done by calling the “build tree” 
function recursively as 

Build_tree ( start _node ){ 

if C start _node .n_eff < threshold) exit ; 

select split variable; 

n_eff .median = evaluate (median split) ; 

n_eff .wtdmean = evaluate (weighted mean split) ; 

if ( min( n.eff .median, n_eff_wtdmean)==start .node. n_eff) exit 

if( n_eff_median<n_eff_wtdmean)splitval= median split; 

else splitval= weighted mean split; 

allocate ( start .node .right) ; 

allocate (start .node . left) ; 

assign node. right.comerl, assign node.comer2; 
assign node. left .comerl, assign node.comer2; 
build.tree( start .node. right ); 

Build.tree( start.node . lef t ); 

> 
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Chapter 5 

Learning to shoot to goal 


Shooting to goal, which is an important behaviour for a soccer playing agents, 
was chosen as the first task to be learned, for this study. Radial basis func- 
tions were used as function approximators. We have consider only the shoot- 
ing angle to be the parameter to be learnt, assuming the shooter kicks with 
the highest velocity. The task is simple in so much so as only a single output 
has to be learnt and that this being a single epoch event does not involve the 
problem of temporal credit apportionment. However the input space dimen- 
sionality is still quite large, of the order of 10^^ and poses difficulties for this 
problem. 

For a shot aimed within an un-tended goal there is a finite probability 
that the goal will be missed, this is compounded with the additional problem 
of intelligent and reactive defender agents trying to stop the goal from being 
scored. 

Goal scoring can be defined as the intersection of two mutually indepen- 
dent events i.e. that of ball remaining within the goal and that the ball is 
not intercepted by any of the defenders including the goal-keeper. 

P{goal) — P {ball within goal) * P {ball intercepted) 

The probability of the ball remaining within goal can be estimated as 

rrightpost 

P{ballwithingoal) = p{B)d9 (5.1) 

J left post 

see Figure (5.1). 

The probability of the ball not being intercepted can be expressed as 
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Figure 5.1: Noise in shooting angle 


P{ballintercepted) = n (1 - -PteXl)) (5-2) 

defenderi 

Where Pint(i) is the probability that the i*^defender intercepts the ball. 

The task, thus is to maximize the probability of scoring goal in presence 
of adversaries and the random perturbations introduced by the soccer-server, 
by choosing a suitable shooting angle 9. 

The task of learning the optimal shooting angle was learnt through su- 
pervised learning using multiple simulations to obtain the optimum value. 
Together with the optimum angle the maximum probability of a successful 
goal shot was also learnt. The latter is nothing but the value function V(5) 
of the state and can be updated online as discussed in section (2.2.1) and 
(3.3). The RBF network was tuned to approximate these functions. 

5.1 Generation of training through simulation 

For this particular case the state consists of the shooter ( assumed to be in 
possession of the ball) , the goal keeper , and the two possible defenders. The 
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Figure 5.2: state variables of the shooting task 


state S is thus represented as an eight tuple 

< dg, Bgjdg, Bgj dj,i, ci(i2, 6^ > 

where d is the reciprocal of the distance from the shooter ( for the goal 
keeper and the defenders) or from the origin ( for the case of the shooter). 
9 corresponds to the angle subtended at the foot of the shooter by the rays 
connecting the opponent and the origin, for the case of 9s it corresponds to 
the angle subtended by the shooter at the origin see figure (5.2). and the 
action A is represented as a real number (3 in the range [0, 1] representing 
the shots to the left goal post to the shots to the right goal post. 

With a state consisting of eight real numbers the state space is huge. 
The enormity of the search space precludes exhaustive enumeration with a 
practical resolution. Thus it is necessary to distribute the training resource 
judiciously over the search space. Random initialisation of the training in- 
stance though tempting has been found to offer poor results [16]. 

5.2 Forming the training set: 

The state space being large it is not possible to populate it evenly with train- 
ing instances, with sufficient density. Training instances have been allocated 
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Figure 5.3: Seeding Training Instances 


to the search space depending on the likelihood of the shooter experiencing 
such cases and their relevance to the output (shooting angle). For instance 
it is intuitive that a defender behind the shooter can very rarely intercept a 
shot so such redundant training cases have been omitted as far as possible. 

As shown in the figure (5.3) the half field as been broken into eight celled 
grid. Training cases were generated by first placing the shooter randomly 
within all the segments. Following which the case of a defender was consid- 
ered. Here the shooter is placed randomly in any of the columns and the 
position of the defender is so generated so that the defender is not placed on 
a column behind the shooters column. Similarly for the 2nd defender. In all 
the cases the goalie if present is located randomly in a small area within the 
penalty box . Ten shooting instances were generated for each configuration 
resulting in around 600 training cases. 


5.3 Evaluation of goal scoring probabilities 

Given a shooting angle parameter [0-1] and the adversaries on the field as 
described by the state <S the probability of scoring a goal 


40 


P{goal/S,p) 

The shot to the goaJ io repress ted by a shooting 
of /? = 0 refers to a shot aimed at tho iraf+ i ° Paraivo . 
refers to a shot aimed at the right post i ^ = 0/ ^ A value 

The probability of ball remaining within~th and /? = j 

and that the ball is intercepted are calr'iiicf^^ ^ ^(^U 7 i■^, - 

Each of these parameters have been mplefflenw"^ 

5.3.1 Probability of ball within goal 

Here the simulation is started with the ball b ’ 

velocity with the shooting angle under invest! vatinf the • 

state is updated using the model of the soccer-server' ffl ‘tae 

the goal line or decays to a halt. At the nf + 1 .* ball 

whether the ball was within the goal post or not tS upo “ 


or 


Piwithin goal IS, e)t+i + = (i _ P {within goal /s 
P{within goal/s, 9)^+, ~ = P (within goal /S,e), 


^)t) * i/t 


where t is the number of iteration. This is continued tin 

, . Uli 




aw Lilli UcU ml n 

variation of this value for different angles of shot, from u c ®^Sence. Th^ 
shown in the figure (5.4)- fixed 


Position 


are 


5.3.2 Probability of interception 

This routine has to take into account all possible behaviours 
making it computationally more intensive than the previa ^ 
before, the simulation is initialised with the shooter shoor^^ Action. As 
a specified shooting angle. We need to find out the probabilif 
can be intercepted by any of the defenders at any point on ae bad 

the shot. ® ^^^.jectory of 

Normally we would have carried out numerous iterations t 
ability. But since the number of dependent variables is Prob- 

%her, the 
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Figure 5.4: Scoring probability for an untended goal, shooting from (10,2). 

number of iteration necessary to find the probability, taking into account all 
the possible variations of the variables ,is expected to be huge. So a different 
strategy is adopted to reduce the computation. 

We define Pbaii{d,t) as the probability distribution of finding the the ball 
at a distance d from the point of kick in time t . 

Thus the probability that the ball is at a distance d from the point of 
kick in the time interval is obtained as 

t+i 

Pball{d, t)dt 

For the discrete model of the soccer server this information is approxi- 
mated as a piecewise constant curve and stored as a sparse matrix 
for an adequate range of values of t and d .This pre-calculation saves on 
generating these values at each iteration at the expense of memory. 

Similarly for the defenders 

Pdefi{d,t) is the probability distribution of finding the defender at a 
distance d from his initial point in time t . It is assumed that the defender 
is running with the maximum possible speed. 

This information is stored subject to the piecewise constant assumption 
as a sparse matrix p£ie/Jd][t]. 

The entire matrix can be visualised as a 3 dimensional surface built by 
stacking these curves along the distance axis , as shown in Fig5.5 and Fig5.6. 
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Figure 5.5: Probability of finding ball at a distance d at time t 

The nature of the second plot is significantly different because the defenders 
can replenish the decay of their speed, and the noise added in their motion 
are lower than that of the ball. 

For a given shooting angle 6 the ball is checked for interceptability along 
all the points of the trajectory of shot. The points on the trajectory are given 
by the vector 

r xi ■, x + lx cos9 . 
yi ” y + ^ X sin 0 •' 

where [ ] is the initial location of the ball (identical to the location of the 

y 

shooter) and I is a parameter. 

For a given shooting angle 9 , at every value of I we evaluate the distance 

Xi 

disti of all the defenders from the point [ ] . This distance is used to find 

yj _ _ 

the probability that the defenders can reach it in time T. This is calculated 
as 
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Figure 5.6: Probability of finding defender at distance d at time t 


^ jPdefi {distj, t) 
0 

similarly for the ball we calculate 

. 

T 

Pballi^T) = YlPballi^yi) 


Xi 

Thus the probability of intercepting the ball at [ ] in time T is given as 

yi 

Pint{l,T,9) = PaefAdistuT)*Pball{l,T) 

Xi 

To find the probability of interception of the ball at [ ] we have to consider 

the probability for T ranging from 0 to oo. We take this probability 

Pint{l,e)=m^Pint{l,T) (5.3) 
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Figure 5.7: Schematic diagram showing maximum interception probability 
over time at different positions along the direction of shot. 



Figure 5.8: Schematic diagram showing maximum interception probability 
over the length of shot. Each point corresponds to the maximum obtained 
from the earlier figure (5.7). 


This quantity is evaluated at all the values of I till the goal is reached a.nd 
the probability that the ball is intercepted for the given shooting angle & is 
calculated as 


Pint{0) - mp;Pint(i) 

The probability of interception taking into consideration all the defenders 
is given by the expression (5.2). The max operators in the equations (5- ) 
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no:of centres 

clustering 

mean absolute error in 

angle 

cross-validation 

<7GCV 

rmse 

mad 

60 

hard k-means 

.142881 

.23050 

1.461 

120 

hard k-means 

.085057 

.208490 

1.950 

60 

soft k-means 

0.175498 

0.272719 

1.381 

120 

soft k-means 

0.132576 

0.275045 

1.616 

60 

gradient descent 

0.102704 

0.205560 

1.778 


Table 5.1: Result of training of optimum angle 


and (5.4) achieve the ideal opponent modelling. 


5.4 Implementation and results 

The plots of the probabilities of goal scoring generated by simulation as 
explained in section (5.3) are shown for different cases in the figures (5.9, 
5.4). They show the effect on widening the “goal window” as the shooter 
comes nearer to the goal. The effect of the side, which the shooter and the 
goal keeper is on is also shown on the next to plots (5.10). 

The RBF network was trained to learn the optimum shooting angle con- 
sidering at most three defenders including the goal keeper. The input was 
the 8 tuple indicated in section [5.1] the output was a parameter ranging 
from 0 to 1 signifying a shot to the left goal post to a shot directed at the 
right goal post. 

It was decided that an angular accuracy that could divide the goal line 
into 5 blocks were sufficient for the purpose. Thus an absolute error in angle 
value less than .2 was enough, but a larger network with 120 centres was 
also investigated to see its effect on the training instances. The table [5.1] 
gives the values of the absolute and generalized cross validation error for 
the networks trained with different number of centres. Note the number of 
centres signify the effective number of states the problem space was broken 
into. 

In addition to these angle values the value functions, which in this case 
was same as scoring probability was also learnt. The results of which are 
shown in the table [5.2]. 

The hard k-means clustering is scene to be performing better for this 
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(a) Probability of beating the 
goal-keeper stationed at the 
centre of the goal line by a 
shooter 5m in front. 
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(b) Goal scoring probability for 
same configuration as on the 
left 
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(c) Probability of beating the 
goal keeper stationed at the 
centre of the goal line , by a 
shooter at 8m in front. 


(d) Goal scoring probability for 
the same configuration as on 
the left 


Figure 5.9: Scoring probabilities from various locations I 
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(a) Probability of beating the 
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(c) Probability of beating the 
goal keeper for the same loca- 
tion of the shooter as above, 
but with the goal-keeper 2m on 
the righ 


(d) Goal scaoring probability 
for the same configuaraion 


Figure 5.10: Scoring probabilities from various locations 
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no:of centres 

— 

clustering 

mean absolute er- 
ror in probability 

cross-validation 

fGCV 

rmse 

mad 

60 

hard k-means 

.143372 

.342407 

2.122 

120 

hard k-means 

.072632 

.249786 

2.627 

60 ^ 

soft k-means 

0.155177 

0.355177 

2.03 

120 

soft k-means 

0.121895 

0.346897 

2.217 

60 

gradient descent 

0.068272 

0.230501 

3.00 


Table 5.2: Result of training scoring probabilities 



No. of Iteration 


Figure 5.11: Error reduction by steepest gradient after hard k means clus- 
tering. 


particular domain. Poor performance of soft clustering can be attnbut^ 
to the higher degree of generalization or “smearing effect” product by soft 
clustering. The centres obtained by hard clustering were further enhanced by 
a gradient descent algorithm, for the case of 60 centres. The plote of which 
are shown in figure (5.11). For the case of 120 centres the error levels were 
regarded to be sufficiently low to make gradient descent unnecess^. It ^ 
oteerved that the gradient descent did not change the location “f ^ 
c by much, in other words the centre locations produced by the k-means 
cluLring are quite close to the optimum. The better ^ 

gradient descent is the outcome of tuning the widtt ” ’ 

initially had been set proportional to the radiiM of _ 

One of the advantages of RBF networks is hat 
hensible model. The rules ( or centres ) l^t for the opti « 

are shown pictographically in the figures(6.12 to 6.17). The shooter is rep 
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resented by a light coloured circle, the opponents are shown as a white ring 
around a black circle. The optimum shooting angle as proposed by the rule 
is shown by a directed straight line. The figures have been arranged roughly 
with an increasing level of opposition. One of the things that can be noted 
is the rules are more extremal, for example in the figure (5.12 bottom) the 
rule proposes a shot to the goal post. This can be explained by the fact that 
these rules are trying to compensate for the fact that the final output being 
averaged over all the other rules will get somewhat moderated. Moreover the 
effect of the goal posts are localised to a fairly small area around each goal 
post figure (5.4) making near goal post shots very effective. 

The RBF tree was implemented both for the case with 120 and 60 centres. 
The generated tree structure (all the nodes have not been fully expanded for 
brevity) is shown in figure(5. 18) .Partitioning was stopped for nodes with n^e 
equal to 5. 




Figure 5.12: Learnt Rules I 
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Figure 5.13: Learnt Rules II 
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Figure 5.15: Learnt Rules IV 
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Figure 5.16: Learnt Rules V 
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Figure 5.17: Learnt Rules VI 
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Figure 5.18; A Partial view of the RRP + r 

centres. The figures represent thl u f generated for the case of 120 
6 present the number of effective units at the nodes. 
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Chapter 6 

Learning to intercept ball 


Interception of moving ball constitutes an important skill in the robot-soccer 
domain as it is necessary for a number of collaborative as well as adversarial 
tasks like receiving a pass from a team member, saving a goal, stealing ball 
from opponent etc. Xhis task can also be directly related to that of choosing 
optimal passing shots in which case the desired speed and direction of the 
pass is decided based on the interception probability by ones teammates. 
However this task has not been adequately studied in the past. Interception 
has been learnt mainly for cases where the ball is directed straight to the 
interceptor [6] and the presence of optimal opponents have been neglected. 


6.1 Input vector and the generation of training 
cases 

The input vector considered for this task consisted of a 5 tuple < d^, uj, dj, do, do > 
where the variable d correspond to the reciprocal of the distance from the 
agent ( or origin for dj,), the variable 6 corresponds to the angular distance 
see figure (6.1). The subscript b refers to the ball and subscript o refers to 
the opponent. 

In learning the interception task the variables dg and where taken in 
the reference frame of the agent, where as for the passing module it can be 
taken in the frame of the ball i.e. the passer. 

A total of 6600 training cases were generated by considering different 
combinations of these variables. Each of the variables were distributed in 
a noise contaminated geometric progression between the highest and lowest 
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Figure 6.1: State variables of interception task 
values fixed a priori. 

6.2 Evaluation of interception probability and 
Q values. 

Instead of simulating each of the 6600 cases thousands of times the same 
probabilities computed in [5.3] were utilized to arrive at the probabilities of 
interception and the associated Q values. 

The action space Ai consisted of 10 macro actions numbered 1-10 for 3 
different interception velocities of the agent ( 3m/s, 6m/s , lOm/s ). The 
action i.e. the i*^action with velocity v referred to intercepting the ball 
at the location where it is expected to be at the time step, i , as mentioned 
could take any value from 1 to 10. These actions are called macro actions 
as they continue to be executed till their exit criteria is reached, as opposed 
to ordinary actions which are selected at every time step. This macro-action 
representation has been found to significantly expedite the learning process 
[31]. 
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6.2.1 Rewards and penalties: 

The behaviour learnt by the agent is entirely guided by the incentives pro- 
vided m the form of the reward function. The required behaviour for this 
task demanded that the agent should intercept the ball before the opponent. 
In this case the opponent was modelled to behave optimally. Thus a reward 
of 1 was given for a successful interception and a penalty of .9 was given in 
case the opponent captured the ball before the agent. 

The magnitude of the penalty has been kept lower than the reward be- 
cause if both of them were equal then there would be no incentive for the 
agent to try an intercept the ball which the opponent is more likely to cap- 
ture. Thus by reducing the magnitude of the punishment the agent was given 
a weakly optimistic view of the world. It was also desired that the agent cap- 
ture the ball in minimum time. The soccer server penalizes each dash by a 
value corresponding to the strength of the dash. Thus a penalty inversely 
proportional to the dash velocity Vd was applied. 

The probability that the agent can intercept the ball was found for every 
point on the ball’s trajectory till 10 simulation steps, similarly for the de- 
fender. Thus in terms of these probabilities Pintercepfagent and Pintercept opponent 
the net Q value could be calculated as 


^ 

Q{^jva ~ T (Pintercept agent ~ ^•^Pintercept opponent ^ 

1-7 


6.3 Implementation and Results 

The radial basis network as trained to learn the q values of the macro actions. 
The results are summarized in the table [6.1]. 

Similar to the goal shooting case the pictorial representation of the be- 
haviours learnt for intercepting the ball is included. Unlike the previous case 
here only the Q(<S x A) model was learnt. The figures represent the the 
configuration for which the Q value is maximum. 

As before the agent is in a light coloured circle whereas the opponent is 
represented by a light ring over a dark interior. The state space representa- 
tion used for this task is based on the relative coordinates so is independent 
of the actual location of ball but for convenience the ball is assumed to be 
initiated from the centre. The light arrow signifies the distance the ball can 
cover in 10 time steps if it is not intercepted, and gives an idea of the veloc- 
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Figure 6.2: Plots of Q values for various configurations. For the bottom case 
note how the Q value falls of sharply because of the presence of an opponent. 
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Figure 6.3: Rules Learnt for interception I 
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Figure 6.4: Rules Learnt for Interception II 
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Figure 6.5: Rules Learnt for Interception III 
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Figure 6.6: Rules Learnt for Interception IV 
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Figure 6.7: Rules Learnt for Interception V 














Figure 6.8: Rules Learnt for Interception VI 


Chapter 7 

Derived Behaviours 


Following the subsumption architecture as explained in chapter (1) typical 
use of these modules as envisaged by a number of researchers [3] are to form 
the lower rungs of a complete learning agent see figure (7.1). A meta level 
gating function is required to activate the most appropriate behaviour de- 
pending on the state. The learnt modules can not only form these elemental 
behaviours but also serve as a basis for bootstrapping other novel behaviours. 

7.1 Behaviours from goal Shooting task 

The Learnt RBFN can act as a domain knowledge from which other im- 
portant tasks can be derived. Among the skills that can be obtained from 
the goal, shooting task, include optimal strategic position occupation by the 
goalie, and the defenders. 

The learnt RBFN returns the optimal goal shooting point and the as- 
sociated probability of scoring. This information can be used within the 
framework of an ideal opponent model ( in this case the shooter) to select 
the best position for the goal keeper. 

A function minimization method like gradient descent can be used to 
minimize the goal scoring probability. The method of gradient descent is 
specially attractive as RBFN model is analytically differentiable. The differ- 
ential coefficients of the state vector x will be similar to that of the centre 
c already derived in section [3.5.4] see equation (3.13). A complete gradient 
descent procedure may prove to be expensive to undertake online, so one 
might resort to undertaking a fixed number of iterations of gradient d^ent. 
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INTERCEPT TO PASS ETC. 


Figure 7.1: Hierarchy of Behaviours 

As an approximation one might consider the RBF outputs to be constant 
in which case the minimality criteria decays to a linear equation in a single 
variable. The accuracy of this can be improved by repeated solution of the 
associated single variable linear equation. A much simpler solution can also 
be undertaken by calculating the goal scoring probabilities at a fixed number 
of goal keeper locations and selecting the minimum. The same strategy can 
be used to consider the position of the defenders near goal. 


7.2 Behaviours from Interception task 

An elemental skill derivable from the interception module include passing, 
This does not involve learning of any new concept, only the state has to 
be transformed to a passer centric model from the recipient centric model. 
In the current implementation a ball centric model was used for the RBF 
network. Depending upon the task desired any of the “recipient” or “passer” 
centric model has to be preprocessed to obtain a ball centric representation 
at the time of use. 

A number of “stitched” tasks of the type “passing for X” and “intercepting 
for X” can also be undertaken with the learnt model. Here “X” can represent 
any other elemental subtask for example shooting to goal. In these cases 
the action selected should maximize the probabihty of success of the total 
task. For example while “intercepting to shoot to goal” the agent will try to 
intercept the ball where the total expected reinforcement i.e. from capturing 
the ball as well as shooting from that location is maximized. This quantity 
is nothing but the product of the output of the learnt models. 

Apart from passing to another opponent, the agents may also choose 
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to dribble forward with tbe ball. Auii 4 gent dribbles by repeatedly kicking 
and intercepting tbe ball "wkile moving towards opponent goal. It is desired 
that the agents penetrate opponent area as fast as possible without losing 
control of the ball. This requires thca-gent to kick hard when the threat of 
interception by opponents is \o~vi and ^Ice versa. 

This task also may be derived om the basis of the tasks already learntj 
using the trained RBF networks 1)0 provide the necessary reinforcement for 
this multi-epoch task. For ever j action undertaken by the agent, the reward 
should consist of the goal shooting p robability from the point reached by the 
hall, the probability of intercepting tiis own shot and a negative reward given 
on the basis of the interception probability of opponents. 



Chapter 8 
Conclusion 


A novel approach of using RBF networks was used for the simulated soccer 
competition domain. The advantages offered being fast training time, com- 
prehensibility in the form of rules, but most of all a powerful representation 
that can be modified incrementally (online) without distorting past training 
sets. This include not only addition of new data but even adding new centres 
or rules. In order to maintain optimality normally one needs to re tune the 
network from scratch, here the same can be undertaken by a computational 
effort lower by an order of magnitude. A tree based algorithm was also de- 
veloped to increase the recall time, thereby making it feasible to utilise much 
larger RBF network models. 

One of the issues that can be investigated in greater detail include the 
the choice of clustering method. The study implemented hard and soft forms 
of k-means clustering together with a gradient decent method for finding the 
centres. In terms of results gradient decent out performs the other 2 methods 
but in terms of time and computational effort it is more expensive. Other 
emerging methods like cooperative competitive genetic algorithms [32] for 
finding the centres have also been tried but needs further study. 
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